{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "p9FfatPz6MU3"
   },
   "source": [
    "# **Homework 1: Linear Regression**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "fsoaNwrZA0ui"
   },
   "source": [
    "本次目標：由前 9 個小時的 18 個 features (包含 PM2.5)預測的 10 個小時的 PM2.5。<!-- 可以參考 <link> 獲知更細項的作業說明。-->\n",
    "\n",
    "<!-- 首先，從 https://drive.google.com/open?id=1El0zvTkrSuqCTDcMpijXpADvJzZC2Jpa 將整個資料夾下載下來，並將下載下來的資料夾放到自己的 Google Drive（注意：上傳到自己 Google Drive 的是資料夾 hw1-regression，而非壓縮檔） -->\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "U7RiAkkjCc6l"
   },
   "source": [
    "# **Load 'train.csv'**\n",
    "train.csv 的資料為 12 個月中，每個月取 20 天，每天 24 小時的資料(每小時資料有 18 個 features)。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "1AfNX-hB3kN8"
   },
   "outputs": [],
   "source": [
    "import sys\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "from google.colab import drive \n",
    "!gdown --id '1wNKAxQ29G15kgpBy_asjTcZRRgmsCZRm' --output data.zip\n",
    "!unzip data.zip\n",
    "# data = pd.read_csv('gdrive/My Drive/hw1-regression/train.csv', header = None, encoding = 'big5')\n",
    "data = pd.read_csv('./train.csv', encoding = 'big5')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "gqUdj00pDTpo"
   },
   "source": [
    "# **Preprocessing** \n",
    "取需要的數值部分，將 'RAINFALL' 欄位全部補 0。\n",
    "另外，如果要在 colab 重覆這段程式碼的執行，請從頭開始執行(把上面的都重新跑一次)，以避免跑出不是自己要的結果（若自己寫程式不會遇到，但 colab 重複跑這段會一直往下取資料。意即第一次取原本資料的第三欄之後的資料，第二次取第一次取的資料掉三欄之後的資料，...）。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "AIGP7XUYD_Yb"
   },
   "outputs": [],
   "source": [
    "data = data.iloc[:, 3:]\n",
    "data[data == 'NR'] = 0\n",
    "raw_data = data.to_numpy()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "V7PCrVwX6jBF"
   },
   "source": [
    "# **Extract Features (1)**\n",
    "![圖片說明](https://drive.google.com/uc?id=1LyaqD4ojX07oe5oDzPO99l9ts5NRyArH)\n",
    "![圖片說明](https://drive.google.com/uc?id=1ZroBarcnlsr85gibeqEF-MtY13xJTG47)\n",
    "\n",
    "將原始 4320 * 18 的資料依照每個月分重組成 12 個 18 (features) * 480 (hours) 的資料。 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "HBnrGYXu9dZQ"
   },
   "outputs": [],
   "source": [
    "month_data = {}\n",
    "for month in range(12):\n",
    "    sample = np.empty([18, 480])\n",
    "    for day in range(20):\n",
    "        sample[:, day * 24 : (day + 1) * 24] = raw_data[18 * (20 * month + day) : 18 * (20 * month + day + 1), :]\n",
    "    month_data[month] = sample"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "WhVmtFEQ9D6t"
   },
   "source": [
    "# **Extract Features (2)**\n",
    "![alt text](https://drive.google.com/uc?id=1wKoPuaRHoX682LMiBgIoOP4PDyNKsJLK)\n",
    "![alt text](https://drive.google.com/uc?id=1FRWWiXQ-Qh0i9tyx0LiugHYF_xDdkhLN)\n",
    "\n",
    "每個月會有 480hrs，每 9 小時形成一個 data，每個月會有 471 個 data，故總資料數為 471 * 12 筆，而每筆 data 有 9 * 18 的 features (一小時 18 個 features * 9 小時)。\n",
    "\n",
    "對應的 target 則有 471 * 12 個(第 10 個小時的 PM2.5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "dcOrC4Fi-n3i"
   },
   "outputs": [],
   "source": [
    "x = np.empty([12 * 471, 18 * 9], dtype = float)\n",
    "y = np.empty([12 * 471, 1], dtype = float)\n",
    "for month in range(12):\n",
    "    for day in range(20):\n",
    "        for hour in range(24):\n",
    "            if day == 19 and hour > 14:\n",
    "                continue\n",
    "            x[month * 471 + day * 24 + hour, :] = month_data[month][:,day * 24 + hour : day * 24 + hour + 9].reshape(1, -1) #vector dim:18*9 (9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9)\n",
    "            y[month * 471 + day * 24 + hour, 0] = month_data[month][9, day * 24 + hour + 9] #value\n",
    "print(x)\n",
    "print(y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "1wOii0TX8IwE"
   },
   "source": [
    "# **Normalize (1)**\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "ceMqFoNI8ftQ"
   },
   "outputs": [],
   "source": [
    "mean_x = np.mean(x, axis = 0) #18 * 9 \n",
    "std_x = np.std(x, axis = 0) #18 * 9 \n",
    "for i in range(len(x)): #12 * 471\n",
    "    for j in range(len(x[0])): #18 * 9 \n",
    "        if std_x[j] != 0:\n",
    "            x[i][j] = (x[i][j] - mean_x[j]) / std_x[j]\n",
    "x"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "NzvXP5Jya64j"
   },
   "source": [
    "#**Split Training Data Into \"train_set\" and \"validation_set\"**\n",
    "這部分是針對作業中 report 的第二題、第三題做的簡單示範，以生成比較中用來訓練的 train_set 和不會被放入訓練、只是用來驗證的 validation_set。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "feF4XXOQb5SC"
   },
   "outputs": [],
   "source": [
    "import math\n",
    "x_train_set = x[: math.floor(len(x) * 0.8), :]\n",
    "y_train_set = y[: math.floor(len(y) * 0.8), :]\n",
    "x_validation = x[math.floor(len(x) * 0.8): , :]\n",
    "y_validation = y[math.floor(len(y) * 0.8): , :]\n",
    "print(x_train_set)\n",
    "print(y_train_set)\n",
    "print(x_validation)\n",
    "print(y_validation)\n",
    "print(len(x_train_set))\n",
    "print(len(y_train_set))\n",
    "print(len(x_validation))\n",
    "print(len(y_validation))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Q-qAu0KR_ZRR"
   },
   "source": [
    "# **Training**\n",
    "![alt text](https://drive.google.com/uc?id=1xIXvqZ4EGgmxrp7c9r0LOVbcvd4d9H4N)\n",
    "![alt text](https://drive.google.com/uc?id=1S42g06ON5oJlV2f9RukxawjbE4NpsaB6)\n",
    "![alt text](https://drive.google.com/uc?id=1BbXu-oPB9EZBHDQ12YCkYqtyAIil3bGj)\n",
    "\n",
    "(和上圖不同處: 下面的 code 採用 Root Mean Square Error)\n",
    "\n",
    "因為常數項的存在，所以 dimension (dim) 需要多加一欄；eps 項是避免 adagrad 的分母為 0 而加的極小數值。\n",
    "\n",
    "每一個 dimension (dim) 會對應到各自的 gradient, weight (w)，透過一次次的 iteration (iter_time) 學習。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "cCzDfxBFBFqp"
   },
   "outputs": [],
   "source": [
    "dim = 18 * 9 + 1\n",
    "w = np.zeros([dim, 1])\n",
    "x = np.concatenate((np.ones([12 * 471, 1]), x), axis = 1).astype(float)\n",
    "learning_rate = 100\n",
    "iter_time = 1000\n",
    "adagrad = np.zeros([dim, 1])\n",
    "eps = 0.0000000001\n",
    "for t in range(iter_time):\n",
    "    loss = np.sqrt(np.sum(np.power(np.dot(x, w) - y, 2))/471/12)#rmse\n",
    "    if(t%100==0):\n",
    "        print(str(t) + \":\" + str(loss))\n",
    "    gradient = 2 * np.dot(x.transpose(), np.dot(x, w) - y) #dim*1\n",
    "    adagrad += gradient ** 2\n",
    "    w = w - learning_rate * gradient / np.sqrt(adagrad + eps)\n",
    "np.save('weight.npy', w)\n",
    "w"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "ZqNdWKsYBK28"
   },
   "source": [
    "# **Testing**\n",
    "![alt text](https://drive.google.com/uc?id=1165ETzZyE6HStqKvgR0gKrJwgFLK6-CW)\n",
    "\n",
    "載入 test data，並且以相似於訓練資料預先處理和特徵萃取的方式處理，使 test data 形成 240 個維度為 18 * 9 + 1 的資料。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "AALygqJFCWOA"
   },
   "outputs": [],
   "source": [
    "# testdata = pd.read_csv('gdrive/My Drive/hw1-regression/test.csv', header = None, encoding = 'big5')\n",
    "testdata = pd.read_csv('./test.csv', header = None, encoding = 'big5')\n",
    "test_data = testdata.iloc[:, 2:]\n",
    "test_data[test_data == 'NR'] = 0\n",
    "test_data = test_data.to_numpy()\n",
    "test_x = np.empty([240, 18*9], dtype = float)\n",
    "for i in range(240):\n",
    "    test_x[i, :] = test_data[18 * i: 18* (i + 1), :].reshape(1, -1)\n",
    "for i in range(len(test_x)):\n",
    "    for j in range(len(test_x[0])):\n",
    "        if std_x[j] != 0:\n",
    "            test_x[i][j] = (test_x[i][j] - mean_x[j]) / std_x[j]\n",
    "test_x = np.concatenate((np.ones([240, 1]), test_x), axis = 1).astype(float)\n",
    "test_x"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "dJQks9JEHR6W"
   },
   "source": [
    "# **Prediction**\n",
    "說明圖同上\n",
    "\n",
    "![alt text](https://drive.google.com/uc?id=1165ETzZyE6HStqKvgR0gKrJwgFLK6-CW)\n",
    "\n",
    "有了 weight 和測試資料即可預測 target。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "jNyB229jHsEQ"
   },
   "outputs": [],
   "source": [
    "w = np.load('weight.npy')\n",
    "ans_y = np.dot(test_x, w)\n",
    "ans_y"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "HKMKW7RzHwuO"
   },
   "source": [
    "# **Save Prediction to CSV File**\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "Dwfpqqy0H8en"
   },
   "outputs": [],
   "source": [
    "import csv\n",
    "with open('submit.csv', mode='w', newline='') as submit_file:\n",
    "    csv_writer = csv.writer(submit_file)\n",
    "    header = ['id', 'value']\n",
    "    print(header)\n",
    "    csv_writer.writerow(header)\n",
    "    for i in range(240):\n",
    "        row = ['id_' + str(i), ans_y[i][0]]\n",
    "        csv_writer.writerow(row)\n",
    "        print(row)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Y54yWq9cIPR4"
   },
   "source": [
    "相關 reference 可以參考:\n",
    "\n",
    "Adagrad :\n",
    "https://youtu.be/yKKNr-QKz2Q?list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49&t=705 \n",
    "\n",
    "RMSprop : \n",
    "https://www.youtube.com/watch?v=5Yt-obwvMHI \n",
    "\n",
    "Adam\n",
    "https://www.youtube.com/watch?v=JXQT_vxqwIs \n",
    "\n",
    "\n",
    "以上 print 的部分主要是為了看一下資料和結果的呈現，拿掉也無妨。另外，在自己的 linux 系統，可以將檔案寫死的的部分換成 sys.argv 的使用 (可在 terminal 自行輸入檔案和檔案位置)。\n",
    "\n",
    "最後，可以藉由調整 learning rate、iter_time (iteration 次數)、取用 features 的多寡(取幾個小時，取哪些特徵欄位)，甚至是不同的 model 來超越 baseline。\n",
    "\n",
    "Report 的問題模板請參照 : https://docs.google.com/document/d/1s84RXs2AEgZr54WCK9IgZrfTF-6B1td-AlKR9oqYa4g/edit"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "hw1_regression.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
